A General Learning Method for Automatic Title Extraction from HTML Pages
Identifieur interne : 000831 ( Main/Exploration ); précédent : 000830; suivant : 000832A General Learning Method for Automatic Title Extraction from HTML Pages
Auteurs : Sahar Changuel [France] ; Nicolas Labroche [France] ; Bernadette Bouchon-Meunier [France]Source :
- Lecture Notes in Computer Science [ 0302-9743 ] ; 2009.
Abstract
Abstract: This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format. In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties. We construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques. Based on these features, learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that combining both methods can induce better performance.
Url:
DOI: 10.1007/978-3-642-03070-3_53
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 000396
- to stream Istex, to step Curation: 000390
- to stream Istex, to step Checkpoint: 000353
- to stream Main, to step Merge: 000839
- to stream Main, to step Curation: 000831
Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct:series"><teiHeader><fileDesc><titleStmt><title xml:lang="en">A General Learning Method for Automatic Title Extraction from HTML Pages</title>
<author><name sortKey="Changuel, Sahar" sort="Changuel, Sahar" uniqKey="Changuel S" first="Sahar" last="Changuel">Sahar Changuel</name>
</author>
<author><name sortKey="Labroche, Nicolas" sort="Labroche, Nicolas" uniqKey="Labroche N" first="Nicolas" last="Labroche">Nicolas Labroche</name>
</author>
<author><name sortKey="Bouchon Meunier, Bernadette" sort="Bouchon Meunier, Bernadette" uniqKey="Bouchon Meunier B" first="Bernadette" last="Bouchon-Meunier">Bernadette Bouchon-Meunier</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:D4D1E3040C032904E47DC6D9E7209FF37CE927F5</idno>
<date when="2009" year="2009">2009</date>
<idno type="doi">10.1007/978-3-642-03070-3_53</idno>
<idno type="url">https://api.istex.fr/document/D4D1E3040C032904E47DC6D9E7209FF37CE927F5/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000396</idno>
<idno type="wicri:Area/Istex/Curation">000390</idno>
<idno type="wicri:Area/Istex/Checkpoint">000353</idno>
<idno type="wicri:doubleKey">0302-9743:2009:Changuel S:a:general:learning</idno>
<idno type="wicri:Area/Main/Merge">000839</idno>
<idno type="wicri:Area/Main/Curation">000831</idno>
<idno type="wicri:Area/Main/Exploration">000831</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">A General Learning Method for Automatic Title Extraction from HTML Pages</title>
<author><name sortKey="Changuel, Sahar" sort="Changuel, Sahar" uniqKey="Changuel S" first="Sahar" last="Changuel">Sahar Changuel</name>
<affiliation wicri:level="3"><country xml:lang="fr">France</country>
<wicri:regionArea>Laboratoire d’Informatique de Paris 6 (LIP6), DAPA, LIP6, 104, Avenue du Président Kennedy, 75016, Paris</wicri:regionArea>
<placeName><region type="region" nuts="2">Île-de-France</region>
<settlement type="city">Paris</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
<author><name sortKey="Labroche, Nicolas" sort="Labroche, Nicolas" uniqKey="Labroche N" first="Nicolas" last="Labroche">Nicolas Labroche</name>
<affiliation wicri:level="3"><country xml:lang="fr">France</country>
<wicri:regionArea>Laboratoire d’Informatique de Paris 6 (LIP6), DAPA, LIP6, 104, Avenue du Président Kennedy, 75016, Paris</wicri:regionArea>
<placeName><region type="region" nuts="2">Île-de-France</region>
<settlement type="city">Paris</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
<author><name sortKey="Bouchon Meunier, Bernadette" sort="Bouchon Meunier, Bernadette" uniqKey="Bouchon Meunier B" first="Bernadette" last="Bouchon-Meunier">Bernadette Bouchon-Meunier</name>
<affiliation wicri:level="3"><country xml:lang="fr">France</country>
<wicri:regionArea>Laboratoire d’Informatique de Paris 6 (LIP6), DAPA, LIP6, 104, Avenue du Président Kennedy, 75016, Paris</wicri:regionArea>
<placeName><region type="region" nuts="2">Île-de-France</region>
<settlement type="city">Paris</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2009</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">D4D1E3040C032904E47DC6D9E7209FF37CE927F5</idno>
<idno type="DOI">10.1007/978-3-642-03070-3_53</idno>
<idno type="ChapterID">53</idno>
<idno type="ChapterID">Chap53</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format. In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties. We construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques. Based on these features, learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that combining both methods can induce better performance.</div>
</front>
</TEI>
<affiliations><list><country><li>France</li>
</country>
<region><li>Île-de-France</li>
</region>
<settlement><li>Paris</li>
</settlement>
</list>
<tree><country name="France"><region name="Île-de-France"><name sortKey="Changuel, Sahar" sort="Changuel, Sahar" uniqKey="Changuel S" first="Sahar" last="Changuel">Sahar Changuel</name>
</region>
<name sortKey="Bouchon Meunier, Bernadette" sort="Bouchon Meunier, Bernadette" uniqKey="Bouchon Meunier B" first="Bernadette" last="Bouchon-Meunier">Bernadette Bouchon-Meunier</name>
<name sortKey="Bouchon Meunier, Bernadette" sort="Bouchon Meunier, Bernadette" uniqKey="Bouchon Meunier B" first="Bernadette" last="Bouchon-Meunier">Bernadette Bouchon-Meunier</name>
<name sortKey="Changuel, Sahar" sort="Changuel, Sahar" uniqKey="Changuel S" first="Sahar" last="Changuel">Sahar Changuel</name>
<name sortKey="Labroche, Nicolas" sort="Labroche, Nicolas" uniqKey="Labroche N" first="Nicolas" last="Labroche">Nicolas Labroche</name>
<name sortKey="Labroche, Nicolas" sort="Labroche, Nicolas" uniqKey="Labroche N" first="Nicolas" last="Labroche">Nicolas Labroche</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000831 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000831 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= ISTEX:D4D1E3040C032904E47DC6D9E7209FF37CE927F5 |texte= A General Learning Method for Automatic Title Extraction from HTML Pages }}
This area was generated with Dilib version V0.6.32. |